Differentiable Search Indices (DSIs) encode a corpus of documents in the parameters of a model and use the same model to map queries directly to relevant document identifiers. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12\%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting by a significant margin. Concretely, it improves the average Hits@10 by $+21.1\%$ over competitive baselines for NQ and requires $6$ times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
变压器模型的缩放属性引起了很多兴趣。但是,在研究不同电感偏差和模型体系结构的缩放特性的效果的前提下,没有做太多事情。模型体系结构的规模不同吗?如果是这样,归纳偏置如何影响缩放行为?这如何影响上游(预训练)和下游(转移)?本文对十种不同模型体系结构的缩放行为进行了系统研究,例如变压器,交换机变压器,通用变压器,动态卷积,表演者以及最近提出的MLP混合物。通过广泛的实验,我们表明(1)架构在执行缩放时确实是一个重要的考虑因素,并且(2)最佳性能模型可以在不同的尺度上波动。我们认为,这项工作中概述的发现对当前在社区中评估模型架构的方式具有重要意义。
translated by 谷歌翻译
基于变压器的大语言模型(LLM)的最新进展已导致许多任务的性能改进。这些收益随着模型的大小而大幅增加,可能导致推理时间缓慢且昂贵的使用。但是,实际上,LLMS制造的一代人由不同的难度组成。尽管某些预测确实从模型的全部容量中受益,但其他延续更为微不足道,可以通过减少的计算来解决。在这项工作中,我们介绍了自信的自适应语言建模(平静),该框架用于动态分配每个输入和生成时间段的不同计算。提前退出解码涉及我们在这里解决的几个挑战,例如:(1)使用什么信心措施; (2)将序列级别的约束连接到局部人口退出决策; (3)由于以前的令牌中的早期退出而返回丢失的隐藏表示形式。通过对三个不同文本生成任务的理论分析和经验实验,我们证明了框架在减少计算的效果 - 潜在的速度最高为$ \ times 3 $ - 同时可维持高性能。
translated by 谷歌翻译
扩展语言模型已被证明可以预测提高各种下游任务的性能和样本效率。相反,本文讨论了一种不可预测的现象,我们将其称为大语言模型的新兴能力。如果在较小的模型中不存在,而是在较大的模型中存在,那么我们认为它可以突然出现。因此,不仅可以通过推断较小模型的性能来预测紧急能力。这种出现的存在意味着额外的扩展可以进一步扩大语言模型的能力范围。
translated by 谷歌翻译
及时调整是以参数有效的方式对预训练的预训练语言模型的新范式。在这里,我们探讨了超级核武器的使用来产生超预价:我们提出了HyperPrompt,这是一种用于迅速基于变形金刚自我注意的任务调节的新型体系结构。超预要是通过超网络通过一代人来学习的端到端。 HyperPrompt允许网络学习特定于任务的功能地图,其中超预告是要参与的查询的任务全局记忆,同时启用了任务之间的灵活信息共享。我们表明,HyperPrompt与强大的多任务学习基线具有竞争力,其额外的任务条件参数的$ 0.14 \%$ $ \%,实现了出色的参数和计算效率。通过广泛的经验实验,我们证明,超级启示可以比强大的T5多任务学习基准和参数效率高效的适配器变体获得卓越的性能,包括及时调整和SuplyFormer ++在许多模型尺寸的自然语言理解胶水和SuperGrue的基准上。
translated by 谷歌翻译
我们认为当前的红外标准,用于优化用户体验,测量太窄的IR空间的一部分。如果IR系统较弱,这些指标缺乏或完全过滤出需要改进的更深层次的文件。如果IR系统相对强,则这些指标欠更深的相关文档,这些文档可以在用户可消化的层次结构或文本摘要中呈现出甚至更强大的IR系统,这些文件甚至可以呈现来自数十或数百个相关文档的内容。我们从过去28年重新分析了超过70个TREC曲目,显示大约一半的欠压排名的文件,几乎所有的缺乏尾部文件。我们展示在2020年的深度学习轨道中,神经系统在排名第一的文件中实际上是近乎最佳的,而在尾部文件上只有BM25的适度增益相比。我们的分析基于简单的新系统导向度量,“雾化搜索长度”,它能够在任何深度准确且均匀地测量所有相关文档。
translated by 谷歌翻译
尽管最近的多任务学习和自然语言处理的转移学习成功(NLP),但很少有效地研究了在训练中缩放任务数量的效果。迈出了这一目标,介绍了Exmix(极端混合物):跨越各个领域和任务家庭的大规模收集107个监督的NLP任务。使用EXMIX,我们研究了最大规模的多任务预培训的影响,并分析了普通任务家庭之间的共同培训转移。通过此分析,我们表明手动策划用于多任务预训练的理想任务,并不简单,而且多任务缩放可以自行改进模型。最后,我们提出了Ext5:使用自我监督跨度去噪和监督EXMIX的多任务目标预先训练的模型。通过广泛的实验,我们表明Ext5优于超级格,宝石,彩虹,封闭书QA任务的强大T5基线,以及Exmix之外的几个任务。 Ext5在预训练时也显着提高了样品效率。
translated by 谷歌翻译
Transformers do not scale very well to long sequence lengths largely because of quadratic self-attention complexity. In the recent months, a wide spectrum of efficient, fast Transformers have been proposed to tackle this problem, more often than not claiming superior or comparable model quality to vanilla Transformer models. To this date, there is no well-established consensus on how to evaluate this class of models. Moreover, inconsistent benchmarking on a wide spectrum of tasks and datasets makes it difficult to assess relative model quality amongst many models. This paper proposes a systematic and unified benchmark, Long-Range Arena, specifically focused on evaluating model quality under long-context scenarios. Our benchmark is a suite of tasks consisting of sequences ranging from 1K to 16K tokens, encompassing a wide range of data types and modalities such as text, natural, synthetic images, and mathematical expressions requiring similarity, structural, and visual-spatial reasoning. We systematically evaluate ten well-established long-range Transformer models (Reformers, Linformers, Linear Transformers, Sinkhorn Transformers, Performers, Synthesizers, Sparse Transformers, and Longformers) on our newly proposed benchmark suite. Long-Range Arena paves the way towards better understanding this class of efficient Transformer models, facilitates more research in this direction, and presents new challenging tasks to tackle. Our benchmark code will be released at https://github.com/google-research/long-range-arena.
translated by 谷歌翻译
The pandemic of these very recent years has led to a dramatic increase in people wearing protective masks in public venues. This poses obvious challenges to the pervasive use of face recognition technology that now is suffering a decline in performance. One way to address the problem is to revert to face recovery methods as a preprocessing step. Current approaches to face reconstruction and manipulation leverage the ability to model the face manifold, but tend to be generic. We introduce a method that is specific for the recovery of the face image from an image of the same individual wearing a mask. We do so by designing a specialized GAN inversion method, based on an appropriate set of losses for learning an unmasking encoder. With extensive experiments, we show that the approach is effective at unmasking face images. In addition, we also show that the identity information is preserved sufficiently well to improve face verification performance based on several face recognition benchmark datasets.
translated by 谷歌翻译